The production and sale of wine is big business. In the USA alone, retail sales of wine reached nearly $56 billion in 2015, which equates to approximately 2.8 gallons consumed per resident per year (https://www.statista.com/topics/1709/alcoholic-beverages/). An important metric in a consumer’s decision about whether to buy a given bottle of wine is the ‘quality’ of that bottle. Quality is often summarized for the consumer as a wine score provided by an expert. The goal of this exploratory data analysis (EDA) is to determine whether any particular chemical properties of wine are good predictors of quality. The dataset used covers white wines from Portugal, as described in the following article:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
For each wine, an expert-derived quality score on a scale of 0 (very bad) to 10 (very excellent) was provided. In addition to the quality score, the dataset contains a measurement for 11 chemical properties of each wine, as will be described later.
The white wine dataset was downloaded from the following site:
https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityWhites.csv)
The file was found to contain the following dataframe:
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
There are 4,898 observations and each observation has 12 variables of interest (X is simply a sequential count for each observation, from 1 to 4,898, and will be dropped from the dataset). There are 11 chemical properties (e.g fixed acidity, volatile acidity etc.) and 1 measure of quality. The main feature in the dataset is ‘quality’, since this is the ultimate measure of each wine and is the variable that one would like to predict.
As seen from the table above, R has intrepreted ‘quality’ to be of type integer. I believe it is more appropriate to treat ‘quality’ as an ‘ordinal’ variable (i.e. a categorical variable for which the possible values are ordered) for portions of this analysis and hence I will create a new variable, called quality.cat, which transorms quality into an ordered categorical variable. The results of the transformation are as follows, confirming ‘quality.cat’ has been transformed into an ordered factor:
## Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 4 4 4 4 4 4 4 4 ...
A statistical summary of the data is given below:
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
##
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
##
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
##
## quality quality.cat
## Min. :3.000 3: 20
## 1st Qu.:5.000 4: 163
## Median :6.000 5:1457
## Mean :5.878 6:2198
## 3rd Qu.:6.000 7: 880
## Max. :9.000 8: 175
## 9: 5
It is interesting to note that none of the wines received either a perfect quality score (10), since the highest level for quality observed in the data is a 9 but none of the wines received a ‘very poor’ quality score (0) either, since the lowest level observed is a 3. It is also interesting to observe that there are very few samples for quality level 3 (only 20) or quality level 9 (only 5). This paucity of data at very low and very high quality scores might make it difficult to draw any statistically significant conclusions about the extremes of the quality scale.
At a glance, it appears there is a fair amount of spread in all of the variables, with meaningful differences between the min, median and max values. I will quantify one element of this spread by calculating the max:median ratio for each variable (excluding ‘quality.cat’):
## fixed.acidity volatile.acidity citric.acid
## 2.088235 4.230769 5.187500
## residual.sugar chlorides free.sulfur.dioxide
## 12.653846 8.046512 8.500000
## total.sulfur.dioxide density pH
## 3.283582 1.045525 1.201258
## sulphates alcohol quality
## 2.297872 1.365385 1.500000
There is a fair amount of variance within variables, with many exhibiting max:mean ratios in excess of 2 (which equates to 100% difference). Density has the lowest ratio, with the maximum only 4.5% higher than the median. It remains to be seen whether this seemingly large spread amongst most of the variables is helpful for predicting wine quality.
Finally, I observe that one variable (citric.acid) has a minimum of zero. Is this a missing data point or a true measurement? I will keep this in mind, but will not do anything with this observation at the moment.
A good way to get an initial feel for the distribution of the data is via histograms. Rather than simply output 12 histograms, I will group the 12 properties into 3 different categories, and look at each category in turn. Since pH is a measure of acidity, I will group pH together with the graphs showing the 3 acid levels (fixed.acidity, volatile.acidity, and citric.acid). Next, I will group together the 5 remaining concentration measurements (residual.sugar, chlorides,free.sulfur.dioxide, total.sulfur.dioxide, and sulphates). Finally, I will group together alcohol, density and quality.
(Note: a bar chart is used in the case of ‘quality.cat’, since it is categorical):
The quality rating appears to be normally distributed, with the bulk of assessments in the middle bins. Density appears normal too, but with some positive skew. The alcohol content is an interesting one, possibly exhibiting bimodal or even trimodal behavior. Lets take a closer look at density and alcohol content, by replotting without the top 1% quantile:
Density looks fairly normally distributed, whereas alcohol content might be bimodal or even trimodal, with ‘low’ (9% median), ‘medium’ (10.5% median) and ‘high’ (12% median) alcohol content populations.
A beer related article I read inspired me to create a new variable for consideration in the analysis. The article can be found at:
http://beerandwinejournal.com/chloride-and-sulfate/
This article discusses that at least in the context of beer brewing, the chlorides to sulphates ratio might be a far more important measure of quality than the individual levels of either ion. Perhaps this ratio is important for wine too, so I will create a chlorides-to-sulphate ratio variable.
In addition, I decided that the free-to-total sulfur dioxide ratio might be interesting (I have worked with pools, and in pool chemistry the free-to-total chlorine ratio is important for the quality of pool water, at least!). Along a similar vein, I think the ratio of volatile acidity to fixed acidity might be important, since there might be a chemical interplay between the two forms of acidity. Finally, I decided to also create a sugar-to-alcohol ratio, since both variables exhibited strange, bimodal like behavior and intuitively it seemed there might be some interplay here, with a sugary taste potentially masking the sometimes unpalatable taste of a higher alcohol content. The new ratios were created and their descriptive statistics and histograms (excluding top 1% quantile) are presented below:
## 'data.frame': 4898 obs. of 4 variables:
## $ chloride_to_sulphate : num 0.1 0.1 0.114 0.145 0.145 ...
## $ free_to_total_sulfure.dioxide: num 0.265 0.106 0.309 0.253 0.253 ...
## $ volatile_to_fixed_acidity : num 0.0386 0.0476 0.0346 0.0319 0.0319 ...
## $ sugar_to_alcohol : num 2.352 0.168 0.683 0.859 0.859 ...
## chloride_to_sulphate free_to_total_sulfure.dioxide
## Min. :0.02121 Min. :0.02362
## 1st Qu.:0.07143 1st Qu.:0.19093
## Median :0.08980 Median :0.25368
## Mean :0.09774 Mean :0.25558
## 3rd Qu.:0.11053 3rd Qu.:0.31579
## Max. :0.62708 Max. :0.71053
## volatile_to_fixed_acidity sugar_to_alcohol
## Min. :0.01111 Min. :0.0566
## 1st Qu.:0.03030 1st Qu.:0.1575
## Median :0.03836 Median :0.4906
## Mean :0.04126 Mean :0.6423
## 3rd Qu.:0.04848 3rd Qu.:0.9773
## Max. :0.18033 Max. :5.6239
The free:total sulfur dioxide graph looks normally distributed. The chloride:sulphate, volatile:fixed acidity and sugar:alcohol graphs look positively skewed. In addition, the sugar:alcohol graph exhibts the same potentially bimodal behavior exhibited by the sugar and the alcohol graphs.
I am now ready to look at the relationship between the various parameters.
I would like to start the bivariate analysis by looking at the correlation coefficients between the variables, as given below:
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.02269729 0.289180698
## volatile.acidity -0.02269729 1.00000000 -0.149471811
## citric.acid 0.28918070 -0.14947181 1.000000000
## residual.sugar 0.08902070 0.06428606 0.094211624
## chlorides 0.02308564 0.07051157 0.114364448
## free.sulfur.dioxide -0.04939586 -0.09701194 0.094077221
## total.sulfur.dioxide 0.09106976 0.08926050 0.121130798
## density 0.26533101 0.02711385 0.149502571
## pH -0.42585829 -0.03191537 -0.163748211
## sulphates -0.01714299 -0.03572815 0.062330940
## alcohol -0.12088112 0.06771794 -0.075728730
## quality -0.11366283 -0.19472297 -0.009209091
## chloride_to_sulphate 0.02109576 0.05938449 0.088940695
## free_to_total_sulfure.dioxide -0.13945918 -0.19616085 0.016241396
## volatile_to_fixed_acidity -0.33775891 0.93662196 -0.245794409
## sugar_to_alcohol 0.09363299 0.04575732 0.102730408
## residual.sugar chlorides
## fixed.acidity 0.08902070 0.02308564
## volatile.acidity 0.06428606 0.07051157
## citric.acid 0.09421162 0.11436445
## residual.sugar 1.00000000 0.08868454
## chlorides 0.08868454 1.00000000
## free.sulfur.dioxide 0.29909835 0.10139235
## total.sulfur.dioxide 0.40143931 0.19891030
## density 0.83896645 0.25721132
## pH -0.19413345 -0.09043946
## sulphates -0.02666437 0.01676288
## alcohol -0.45063122 -0.36018871
## quality -0.09757683 -0.20993441
## chloride_to_sulphate 0.07800801 0.90145468
## free_to_total_sulfure.dioxide 0.05142979 -0.03321768
## volatile_to_fixed_acidity 0.01515041 0.04457791
## sugar_to_alcohol 0.99001187 0.11932114
## free.sulfur.dioxide total.sulfur.dioxide
## fixed.acidity -0.0493958591 0.091069756
## volatile.acidity -0.0970119393 0.089260504
## citric.acid 0.0940772210 0.121130798
## residual.sugar 0.2990983537 0.401439311
## chlorides 0.1013923521 0.198910300
## free.sulfur.dioxide 1.0000000000 0.615500965
## total.sulfur.dioxide 0.6155009650 1.000000000
## density 0.2942104109 0.529881324
## pH -0.0006177961 0.002320972
## sulphates 0.0592172458 0.134562367
## alcohol -0.2501039415 -0.448892102
## quality 0.0081580671 -0.174737218
## chloride_to_sulphate 0.0793673567 0.109439559
## free_to_total_sulfure.dioxide 0.7386321024 -0.013447850
## volatile_to_fixed_acidity -0.0848067079 0.039437265
## sugar_to_alcohol 0.3143238443 0.429487399
## density pH sulphates
## fixed.acidity 0.26533101 -0.4258582910 -0.01714299
## volatile.acidity 0.02711385 -0.0319153683 -0.03572815
## citric.acid 0.14950257 -0.1637482114 0.06233094
## residual.sugar 0.83896645 -0.1941334540 -0.02666437
## chlorides 0.25721132 -0.0904394560 0.01676288
## free.sulfur.dioxide 0.29421041 -0.0006177961 0.05921725
## total.sulfur.dioxide 0.52988132 0.0023209718 0.13456237
## density 1.00000000 -0.0935914935 0.07449315
## pH -0.09359149 1.0000000000 0.15595150
## sulphates 0.07449315 0.1559514973 1.00000000
## alcohol -0.78013762 0.1214320987 -0.01743277
## quality -0.30712331 0.0994272457 0.05367788
## chloride_to_sulphate 0.18691463 -0.1454060343 -0.36185381
## free_to_total_sulfure.dioxide -0.06552475 0.0008012900 -0.02236186
## volatile_to_fixed_acidity -0.07540469 0.1136748292 -0.02891024
## sugar_to_alcohol 0.87168339 -0.2013195265 -0.01803066
## alcohol quality
## fixed.acidity -0.12088112 -0.113662831
## volatile.acidity 0.06771794 -0.194722969
## citric.acid -0.07572873 -0.009209091
## residual.sugar -0.45063122 -0.097576829
## chlorides -0.36018871 -0.209934411
## free.sulfur.dioxide -0.25010394 0.008158067
## total.sulfur.dioxide -0.44889210 -0.174737218
## density -0.78013762 -0.307123313
## pH 0.12143210 0.099427246
## sulphates -0.01743277 0.053677877
## alcohol 1.00000000 0.435574715
## quality 0.43557472 1.000000000
## chloride_to_sulphate -0.30635643 -0.192803276
## free_to_total_sulfure.dioxide 0.06446642 0.197214077
## volatile_to_fixed_acidity 0.11281181 -0.141314426
## sugar_to_alcohol -0.53683146 -0.134750485
## chloride_to_sulphate
## fixed.acidity 0.02109576
## volatile.acidity 0.05938449
## citric.acid 0.08894070
## residual.sugar 0.07800801
## chlorides 0.90145468
## free.sulfur.dioxide 0.07936736
## total.sulfur.dioxide 0.10943956
## density 0.18691463
## pH -0.14540603
## sulphates -0.36185381
## alcohol -0.30635643
## quality -0.19280328
## chloride_to_sulphate 1.00000000
## free_to_total_sulfure.dioxide 0.00332709
## volatile_to_fixed_acidity 0.03753703
## sugar_to_alcohol 0.10129093
## free_to_total_sulfure.dioxide
## fixed.acidity -0.13945918
## volatile.acidity -0.19616085
## citric.acid 0.01624140
## residual.sugar 0.05142979
## chlorides -0.03321768
## free.sulfur.dioxide 0.73863210
## total.sulfur.dioxide -0.01344785
## density -0.06552475
## pH 0.00080129
## sulphates -0.02236186
## alcohol 0.06446642
## quality 0.19721408
## chloride_to_sulphate 0.00332709
## free_to_total_sulfure.dioxide 1.00000000
## volatile_to_fixed_acidity -0.13913499
## sugar_to_alcohol 0.04818126
## volatile_to_fixed_acidity sugar_to_alcohol
## fixed.acidity -0.337758911 0.093632991
## volatile.acidity 0.936621961 0.045757325
## citric.acid -0.245794409 0.102730408
## residual.sugar 0.015150413 0.990011869
## chlorides 0.044577906 0.119321139
## free.sulfur.dioxide -0.084806708 0.314323844
## total.sulfur.dioxide 0.039437265 0.429487399
## density -0.075404688 0.871683392
## pH 0.113674829 -0.201319527
## sulphates -0.028910236 -0.018030664
## alcohol 0.112811806 -0.536831461
## quality -0.141314426 -0.134750485
## chloride_to_sulphate 0.037537028 0.101290928
## free_to_total_sulfure.dioxide -0.139134993 0.048181261
## volatile_to_fixed_acidity 1.000000000 -0.002877822
## sugar_to_alcohol -0.002877822 1.000000000
Based on the correlations, it appears several chemicals negatively impact quality (correlations are shown in parentheses below):
Let’s create a new variable, ‘bad_solids’, that adds them together (they all have units g per dm^3). The new variable has the following statistics and correlation coefficient with quality:
## num [1:4898] 28.38 8.59 15.73 16.31 16.31 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.883 9.470 12.670 13.900 17.470 75.240
Correlation with quality:
## [1] -0.1175407
This new variable does not have a particularly strong correlation with quality. Its correlation coefficient (-0.12) is weaker than or basically equal to the individual correlations of many of its components. So this avenue looks like a dead end and I will not utilize this particular variable going forward.
I would like to narrow down the analysis to those variables that have a modest correlation with quality (say a coefficient with an absolute value on the order of 0.15). The list is as follows, with the correlations versus quality shown in parentheses:
The list of variables being dropped (since their correlations with quality aren’t high enough) are as follows:
A scatterplot matrix can be a helpful early step in EDA. This matrix will allow us to get a sense as to whether there are trends between various variables in the dataset. First, I’ll generate a scatterplot matrix using all the selected variables:
Although there appear to be some trends, the plot is too dense for any meaningful analysis, so I will split it up a bit. First, I will generate two scatterplot matrices that involve the primary feature of interest (quality):
The most interesting observations I gleam from the scatterplots are as follows:
The higher quality categories (8,9) exhibit a lot less variance for most of the variables (e.g. if you look at the volatile.acidity, chlorides, density, total.sulfur.dioxide parameters, you notice far less variance in these parameters when the quality is a category 8 or 9).
There appears to be a meaningful, positive relationship between quality and alcohol percent that definitely warrants further investigation (0.44 correlation).
There appears to be a possibly linear relationship between density and alcohol percent, with a relatively strong correlation (-0.78).
Alcohol percent and total sulfur dioxide have an inverse correlation (-0.45) that might warrant further investigation.
Next, I will generate two more scatterplots that are focused on relationships between the non-quality variables:
The most interesting observations I gleam from these additional scatterplots are as follows:
In these plots, I observed some very strong correlations (0.9+) between the ratio variables I created and their components (e.g. the chloride:sulfate ratio has a 0.90 correlation with the chloride level). While some correlation is obviously expected, since the derived variable contains the component variable, a correlation at this level is suggestive that there is a link of some sort between the components themselves, and that the ratios therefore might have statistical significance.
The strongest correlation observed amongst variables that do not inovolve derivatives of themselves is the 0.87 observed between the sugar:alcohol ratio and density.
Let’s now take a closer look at some of the interesting bivariate pairs:
As can be seen from the graph, a linear approximation appears to do a reasonable job in describing the relationship between density and the sugar:alcohol ratio. This was the variable pair that exhibited the highest correlation coefficient (0.87) amongst the variables being considered here. As the ratio increases, so does density. If one knew that a certain wine had a sugar to alcohol ratio of 1.5, for example, one could likely make some reasonable predictions about what the density of that particular wine would be (e.g. 0.995 to 1.000, with a relatively high degree of confidence)
Density was also observed to have a strong inverse correlation with the alcohol content (-0.78). Let’s consider a graph of these two variables:
This inverse relationship is apparent from the graph, although at any given alcohol level, there is a fair amount of variability in the density value.
From this graph, one can see the negative correlation between the two variables, but there is so much variance that the relationship is unlikely to carry much predictive value. For example, if one knew the alcohol percent of a given wine was 12%, one would not be able to say with much certainty as to what the total sulfur dioxide level is likely to be, given how dispersed the points are. It looks just as likely to be 75 as 175, at a 12% alcohol level. So this bivariate combination is unlikely to carry much predictive value.
Next, lets consider the relationship between the quality measurement and various parameters.
It appears that in general, higher quality wines have lower chloride:sulphate ratios, as exhibited by the decreasing median values as the quality category increases.
There does not appear to be any particularly promising trend between the volatile acidity level and quality, since the median volatile acidity rises and falls with changes in the quality category, with no apparent trend.
It appears that in general, higher quality wines have higher free:total sulfur dioxide ratios, since the median values appear to consistently increase as quality increases.
It appears that in general, higher quality wines have lower chloride levels, since the median value of chlorides drops with increasing quality.
The relationship between density and quality appears to be quite strong: higher quality wines (quality rating of 7 or higher) appear to be lower density compared to lower quality wines (quality rating of 5 or lower), based on the large differences in the median density observed between the quality extremes.
The relationship between alcohol content and quality appears potentially promising, particularly at the higher end of the quality scale, where there is a clear upwards trend in quality (from levels 5 through 9) as the median alcohol content increases.
It is hard to discern any clear trend between the sugar:alcohol ratio and a wine’s quality, given that the median values move up and down as the quality improves.
I will now consider the interaction of multiple variables. First, it was observed in the bivariate analysis that there is a relatively strong inverse relationship between density and the alcohol content (correlation coefficient of -0.78). The quality levels can be layered onto that graph as well:
The quality points fall on the graph in a pattern that suggests the higher quality wines tend to have high alcohol content and also low density, as seen by the fact that the lower right portion of the graph is dominated by those wines in the 7-9 quality range, whereas the upper left portion is dominated by wines in the 3-5 quality range. Wines of quality 6 are dispered throughout.
It was observed during the bivariate analysis that there was a strong correlation between the chloride level and the chloride:sulphate ratio. The quality levels can be layered onto that graph as well:
It appears there might be a tendency for high quality wines to be low chloride and low chloride:sulphate ratio. Let’s zoom in on the lower left portion of the graph, which contains most of the data points, by truncating out the top 5% quantile for each variable:
There does indeed appear to be a tendency for the higher quality wines to be lower in chlorides and chloride:sulphate ratio, given that the quality 7-9 wines have tended to cluster in the lower left portion of the graph, whereas the quality 3-5 wines are more in the upper right portion.
Next, lets consider volatile acidity and the free:total sulfure dioxide ratio. During the bivariate analysis, this variable pair was observed to have one of the strongest correlations with quality (-0.19), so it seems worth considering in a multivariate format too, where quality is layered on the graph:
When we do so, there is no strong pattern regarding where the higher versus lower quality wines fall on the graph. The quality points are dispersed throughout, even though there might be some weak relationships in terms of where they tend to fall.
Now lets look at the bivariate pair that exhibited the highest correlation coefficient, namely density and the sugar:alcohol ratio, which had a correlation coefficient of 0.87. To deepen the insight into these two variables and how they might impact wine quality, lets layer quality onto the graph:
A very interesting graph results, where there appears to be a strong tendency for the higher quality wines to cluster below the trendline whereas the lower quality wines tend to cluster above the trendline. In other words, for a given sugar:alcohol ratio, higher quality wines tend to be less dense, and above a certain sugar:alcohol ratio (approximately 1.5), there appear to be very few good quality wines.
It was observed at the very beginning of the analysis that one drawback of this data set is the relatively small number of samples for wines at the extreme ends of the quality spectrum. For example, of the nearly 5,000 wines in the dataset, there were zero wines of qualities 0,1,2 or 10. There were only 20 wines of quality 3 and only 5 wines of quality 9. Given the tiny number of samples on the extremes of the quality spectrum, it is possible that the dataset is being partitioned too finely. This seems particularly possible given that ‘quality’ is ultimately an expert’s judgement call rather than an easy-to-measure number, so one might expect a legitimate quality level 7 wine to be tagged as a 6 or an 8, depending on which expert makes the judgement.
To address this, I would like to consider how things might look if the quality categories are more ‘coarse’ and hence each category has many more samples. To do so, lets consider any wine with a 3-5 rating as ‘bad’, a wine with a 6 rating as ‘ok’ and a wine with a 7-9 rating as ‘good’. When the data is split along these lines, one obtains the following sample count per category:
## bad ok good
## 1640 2198 1060
Thus, after completing this transformation, each of the quality buckets has a significant number of data points (>1,000 per category), which might improve the ability to draw conclusions regarding trends in the data. Now lets consider whether there are trends between the variables of interest and these new quality categories:
With this coarser partioning of quality, the trends become more consistent and we can draw some conclusions about the impact of the various chemical properties on quality:
The following variables correlate inversely with quality (i.e. quality decreases as these variables increase in value):
The following variables correlate with quality (i.e. quality increases as these variables increase in value):
Let’s now overlay the new quality categories on the density vs. alcohol content graph:
The categories split quite well: good wines tend to have higher alcohol content and lower density levels.
Looking at density vs. sugar:alcohol ratio in the context of these new quality categories, we observe the following:
Here the split appears even stronger. The good wines cluster at lower density and lower sugar:alcohol levels. Further, at a given sugar:alcohol ratio, the good quality wines tend to have lower densities than the bad quality wines.
Given the correlations and trends observed between wine quality and the measurable properties of a given wine, the next task was to build a model for predicting the quality level of any given wine. One could imagine such a model being very useful for wine manufacturers, in terms of monitoring the properties of the wines being produced by any given vineyard at any point in time, in an attempt to improve the likely quality outcome.
Since the variable being predicted by the model is ordinal, the appropriate modeling technique is an ordered logistic regression model. The polr command from the MASS package serves this purpose and the results of the model are shown below, for the case where the quality categories 3-9 are the desired prediction outcome:
## Call:
## polr(formula = quality.cat ~ alcohol + density + chlorides +
## free_to_total_sulfure.dioxide + volatile.acidity + chloride_to_sulphate +
## sugar_to_alcohol, data = data_subset, Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## alcohol 0.8542 0.03167 26.975
## density -111.8724 0.20976 -533.324
## chlorides 8.7427 3.00925 2.905
## free_to_total_sulfure.dioxide 2.6950 0.31039 8.682
## volatile.acidity -4.9135 0.29970 -16.395
## chloride_to_sulphate -5.2279 1.26377 -4.137
## sugar_to_alcohol 0.9801 0.06205 15.796
##
## Intercepts:
## Value Std. Error t value
## 3|4 -108.6419 0.2241 -484.8866
## 4|5 -106.2938 0.2268 -468.7052
## 5|6 -103.2812 0.2345 -440.4678
## 6|7 -100.7124 0.2508 -401.6273
## 7|8 -98.4732 0.2679 -367.5771
## 8|9 -94.7995 0.5169 -183.3921
##
## Residual Deviance: 10950.99
## AIC: 10976.99
## [1] "Confidence Levels:"
## 2.5 % 97.5 %
## alcohol 0.7921375 0.9162693
## density -112.2835616 -111.4612995
## chlorides 2.8446915 14.6407329
## free_to_total_sulfure.dioxide 2.0866125 3.3033379
## volatile.acidity -5.5008794 -4.3260881
## chloride_to_sulphate -7.7048769 -2.7509715
## sugar_to_alcohol 0.8585068 1.1017364
A model can also be built for the scenario where the ‘transformed’ quality categories of ‘bad’, ‘ok’, and ‘good’ are the desired prediction outcome, and those modeling results are as follows:
## Call:
## polr(formula = good_bad ~ alcohol + density + chlorides + free_to_total_sulfure.dioxide +
## volatile.acidity + chloride_to_sulphate + sugar_to_alcohol,
## data = data_subset, Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## alcohol 0.8785 0.03354 26.189
## density -113.3714 0.19014 -596.258
## chlorides 10.3102 3.20688 3.215
## free_to_total_sulfure.dioxide 2.5025 0.31811 7.867
## volatile.acidity -4.9631 0.32957 -15.059
## chloride_to_sulphate -6.1117 1.32626 -4.608
## sugar_to_alcohol 0.9574 0.06489 14.754
##
## Intercepts:
## Value Std. Error t value
## bad|ok -104.6088 0.2124 -492.4536
## ok|good -102.0281 0.2313 -441.0191
##
## Residual Deviance: 8700.39
## AIC: 8718.39
## [1] "Confidence Levels:"
## 2.5 % 97.5 %
## alcohol 0.8127184 0.9442035
## density -113.7440702 -112.9987423
## chlorides 4.0248603 16.5955912
## free_to_total_sulfure.dioxide 1.8789933 3.1259491
## volatile.acidity -5.6090258 -4.3171442
## chloride_to_sulphate -8.7111218 -3.5122790
## sugar_to_alcohol 0.8301835 1.0845497
Both models appear to fit the data well, with the estimated value to standard error ratio (i.e. the t-value) exceeding 2.9 for all parameters. The parameter estimate for alcohol content and density had the highest t-values for both models, which is not surprising given the trends observed in the multivariate graphs, where these two properties were key predictors of a given wine’s quality score.
Both models have limitiations however. First, they are only valid for the quality range exhibited in the dataset. Since the dataset only contained wines in the 3-9 quality range, these models would be unreliable at identifying wines outside of this range. Second, the models are only valid for the particular wine under consideration here (i.e. Portuguese “Vinho Verde” wines). A new model would likely be needed for each wine variety, or at the very least, this model would need to be validated against a new set of data before one could make any claims about its applicability beyond this particular dataset and wine variety.
In this section, three particularly interesting graphs that help summarize the key findings from the EDA are presented.
This plot demonstrates that in general, the high quality wines (quality 7-9) tend to have high alcohol content and low density, as shown by the preponderance of green shaded points in the lower right quadrant of the graph. Conversely, the poor quality wines (quality 3-5) tend to have low alcohol content and high density, dominating the two left side quadrants.
This plot demonstrates that once wine quality is transformed into more coarse bins (i.e. ‘bad’,‘ok’ and ‘good’ instead of integers 3-9) then consistent trends emerge in the impact of various chemical properties on wine quality. Specifically, as the density and the sugar:alcohol ratio decrease, the wine quality increases and as the percent alcohol increases the wine quality increases.
This plot summarizes the key findings from the EDA exercise: at a given sugar:alcohol level, high quality wines tend to have lower densities than low quality wines. Further, beyond a certain sugar:alcohol ratio (approximately 1.0 - 1.5) there is a preponderance of bad quality wines compared to good quality wines.
One major struggle in this EDA was getting my head around all the variables and their impact on quality. It wasn’t just the large number of variables that was problematic, but the fact that many of these variables are likely related. For example, if one assumes that wine fermentation involves the breaking down of ‘sugar’ to create alcohol, then the residual sugar level in the wine and the percent of alcohol must impact each other. Similarly, my basic knowledge of chemistry suggests that many of these variables could be related through chemical reactions in many possible combinations. So, it felt a little bit like having a machine with many handles, where pulling one handle would result in a different handle also moving. To move forward, I elected to focus on those variables that had a correlation with quality above a certain threshold.
Despite this struggle, the sheer act of playing with the data in multiple ways and combinations began to yield insights into the relationships between the variables and this felt like a success to me. I was also satisfied to be able to generate a regression model that intuitively made sense to me (i.e. the directional impact of various variables on quality were consistent with the EDA) and also had sensible parameters, with relatively high t-values.
I think the most interesting area for future exploration with this dataset would be to utilize machine learning techniques to see if a predictive model can be built that is superior to the regression model built here. This dataset looks ripe for machine learning, given the potentially complex interplay between the various chemical properties being measured. It would be helpful to have a bigger dataset if one were to apply machine learning and also a much larger number of datapoints from the extremes of the quality spectrum (i.e. many more quality 0-3 and quality 9-10 wines).
A second area of exploration would be to test whether the trends observed here hold true for other wine types. For example, a similar dataset exists for red wines, so analysing this red wine set might be a great option for taking this analysis to the next level.